inequality use
Adaptive Calibration in Non-Stationary Environments
Liu, Junyan, Luo, Haipeng, Ratliff, Lillian J.
Making calibrated online predictions is a central challenge in modern AI systems. Much of the existing literature focuses on fully adversarial environments where outcomes may be arbitrary, leading to conservative algorithms that can perform suboptimally in more benign settings, such as when outcomes are nearly stationary. This gap raises a natural question: can we design online prediction algorithms whose calibration error automatically adapts to the degree of non-stationarity in the environment, smoothly interpolating between i.i.d. and adversarial regimes? We answer this question in the affirmative and develop a suite of algorithms that achieve adaptive calibration guarantees under multiple calibration measures. Specifically, with $T$ being the number of rounds, $K$ being the unknown number of i.i.d. segments of the environment, and $C\in[0,T]$ being another unknown non-stationary measure defined as the minimal $\ell_1$ deviation of the mean outcomes, our algorithms attain $\widetilde{O}(\min\{\sqrt{T}+(TC)^{\frac{1}{3}}, \sqrt{KT}\})$ for $\ell_1$ calibration error and $\widetilde{O}(\min\{(1+C)^{\frac{1}{3}}, K\})$ for both $\ell_2$ and pseudo KL calibration error. These bounds match the optimal rates in the stationary case ($C=0$ and $K=1$) and recover known guarantees in the fully adversarial regime ($C, K=ฮฉ(T)$). Our approach builds on and extends prior work [Hu et al., 2026, Luo et al., 2025], introducing an epoch-based scheduling together with a novel non-uniform partition of the prediction space that allocates finer resolution near the underlying ground truth.
Decentralized Randomly Distributed Multi-agent Multi-armed Bandit with Heterogeneous Rewards
We study a decentralized multi-agent multi-armed bandit problem in which multiple clients are connected by time dependent random graphs provided by an environment. The reward distributions of each arm vary across clients and rewards are generated independently over time by an environment based on distributions that include both sub-exponential and sub-Gaussian distributions. Each client pulls an arm and communicates with neighbors based on the graph provided by the environment. The goal is to minimize the overall regret of the entire system through collaborations. To this end, we introduce a novel algorithmic framework, which first provides robust simulation methods for generating random graphs using rapidly mixing Markov chains or the random graph model, and then combines an averaging-based consensus approach with a newly proposed weighting technique and the upper confidence bound to deliver a UCB-type solution. Our algorithms account for the randomness in the graphs, removing the conventional doubly stochasticity assumption, and only require the knowledge of the number of clients at initialization. We derive optimal instance-dependent regret upper bounds of order logT in both sub-Gaussian and sub-exponential environments, and a nearly optimal mean-gap independent regret upper bound of order T logT up to a logT factor. Importantly, our regret bounds hold with high probability and capture graph randomness, whereas prior works consider expected regret under assumptions and require more stringent reward distributions.
Appendix Algorithm 2 Vanilla Pessimistic Value Iteration
Absolute Constant C, failure probability ฮด. 2: Initialization: Set bVH+1() 0. 3: for time h= H,H 1,...,1 do 4: Set bQh(,) brh(,) + ( bPh bVh+1)(,) 5: sh,ah, set ฮh(sh,ah) = The upper bound of OPE comes from Yin and Wang [2020], Duan et al. [2020] and the Cramer-Rao lower bound comes from Jiang and Li [2016]. Table A shows the statistical optimality for offline policy learning and offline policy evaluation (OPE) in the non-stationary tabular MDPs. By Cauchy-Schwartz inequality, it can be checked that the rate between the two bounds (roughly) deviate by a factor of H (in terms of sample complexity), and this reveals that offline learning is inherently harder than OPE from the statistical aspect. Our analysis of the intrinsic learning bound in Section 4 leverage the key design feature of APVI that bVh+1 only depends on the transition data from time h+ 1 to H while bPh only uses transition pairs at time h. This enables concentration inequalities due the conditional independence.7 Especially, to recover the q VarP(V?h+1) structure to we Next, we use (3) as the intermediate step to crude bounding ||bVh+1 V?h+1|| . Lastly, we can combine this with the extended value difference lemma in Cai et al. [2020] to bound V?1 bV1 and leverage the pessimistic design for bounding bV1 Vbฯ1 .